[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

zhentaoyu · 2024-08-02T02:23:57Z

What does this PR do?

Results

python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf --max_new_tokens 4096 --bf16 --use_kv_cache --attn_softmax_bf16 --reuse_cache --do_sample --prompt "Tell me somethings about Intel"

with --kv_cache_on_host

```bash Stats: -------------------------------------------------------------------------------------------------------------- Throughput (including tokenization) = 2.132539697795915 tokens/second Number of HPU graphs = 14 Memory allocated = 12.68 GB Max memory allocated = 12.77 GB Total memory available = 94.62 GB Graph compilation duration = 5842.699780527037 seconds~~ -------------------------------------------------------------------------------------------------------------- ```

update 4b0fa1a

Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 12.22449896564133 tokens/second
Number of HPU graphs                = 0
Memory allocated                    = 12.68 GB
Max memory allocated                = 12.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 1010.5770402610069 seconds
--------------------------------------------------------------------------------------------------------------

without --kv_cache_on_host

Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 31.41817953959749 tokens/second
Number of HPU graphs                = 11
Memory allocated                    = 14.68 GB
Max memory allocated                = 14.68 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 397.36551256105304 seconds
--------------------------------------------------------------------------------------------------------------

Limitations

can not generate correct results when --use_hpu_graphs because it has host-device memory transfer in the self-attn forward process.

cc @airMeng and @luoyu-intel

Update

Yi-34b-chat on gaudi-2 with ~11k input + 5k output
command:

python run_generation.py \
--model_name_or_path 01-ai/Yi-34B-Chat \
--use_kv_cache \
--bf16 \
--attn_softmax_bf16 \
--reuse_cache \
--do_sample \
--dataset_name emozilla/pg19-test \
--batch_size 1 \
--max_input_tokens 11200 \
--column_name "text" \
--dataset_max_samples 1 \
--warmup 0 \
--n_iterations 1 \
--max_new_tokens 5000 \
--kv_cache_on_host

without kv_cache_on_host:

 09/18/2024 05:28:11 - INFO - __main__ - Graph compilation...
Traceback (most recent call last):
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 707, in <module>
    main()
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 655, in main
    generate_dataset(batch)
  File "/data/optimum-habana/examples/text-generation/run_generation.py", line 633, in generate_dataset
    outputs = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1299, in generate
    result = self._sample(
  File "/data/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2239, in _sample
    self.htcore_generation.mark_step()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/utils/internal.py", line 26, in wrapper
    func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/step_closure.py", line 66, in mark_step
    htcore._mark_step(device_str, sync)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER workspace Allocation of size ::28127918336 failed!

with kv_cache_on_host:

Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------

eblarge output token num with kv_cache_on_host:
--max_input_tokens 11200 --max_new_tokens 10000

Stats:
----------------------------------------------------------------------
Throughput (including tokenization) = 1.2790787964372536 tokens/second
Total runtime for dataset: 3909.073683977127
Memory allocated                    = 90.72 GB
Max memory allocated                = 91.63 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3907.185397926951 seconds
----------------------------------------------------------------------

airMeng · 2024-08-02T02:25:27Z

@hshen14 @luoyu-intel for awareness

airMeng · 2024-08-07T01:04:24Z

@mandy-li @libinta @dvarshney-habana This is the first PR of system optimization from intel neural compressor(INC) team, could you give a review?

Experiments of Llama2 on single Gaudi2 card with Xeon 8380 host. With offloading KV Cache and SDPA to CPU, we improve the context limit from 26k(input:10k+output:16k) to 310k(input:10k+output:300k).

Config	Context	HPU Memory (GB, steady/peak)	CPU Memory (GB)
KV cache on HPU	10k+16k	~90GB	NA
KV cache on HPU	10k+100	83.36/84.11	4.4
KV cache on HPU	12k+100	91.78/92.72	5.03
KV cache on HPU	12k+10k	92.06/93.0	7.68
KV cache on HPU	12k+100k	OOM	N/A
KV cache on HPU	10k+100k	86.22/86.97	31
KV cache on HPU	10k+300k	91.94/92.70	85

emascarenhas · 2024-09-03T14:51:04Z

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

zhentaoyu · 2024-09-04T07:10:01Z

Please sync your PR with main/upstream and fix any merge conflicts. Thank you.

done.

imangohari1 · 2024-09-10T19:39:44Z

@zhentaoyu
Thanks for the PR and the results in description.
Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences?
We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

zhentaoyu · 2024-09-11T02:24:51Z

@zhentaoyu Thanks for the PR and the results in description. Do I read this correctly that the use of kv chache on host is degregading the throughput, while not generating correct answer with hpu graphs? if so, what's the use of this option?

this PR also has merge conflict with main, could you please take a look at the differences? We need to test this PR with CI system to make sure it is not breaking anything and it is not impacting any performance.

Yes. It's an option for long-context inference or generation when a single hpu card OOM. In this PR, I just use torch.Tensor.to to transfer kv_cache related tensors between CPU and Gaudi2 and make next token sdpa happen on CPU for saving data transferring time. However, It can not generate right answer when --use_hpu_graphs. I'm not familiar with the habana synapse graph, and please tell me if you have any insights, I'm happy to try to fix it.
Ok, I have rebased the PR.

zhentaoyu · 2024-09-12T08:45:37Z

Hi, @imangohari1, I have updated the PR (see descriptions). Could you please retake a look when you have free time? Please let me know if you have more comments or need more tests. Thanks a lot.
cc @hshen14

yeonsily · 2024-09-17T22:00:27Z

optimum/habana/transformers/generation/utils.py

+                    else:
+                        unwrap_deepspeed_model(self).allocate_kv_cache(
+                            bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens
+                        )


From line 1096 to 1107, I would like to suggest to change like this.

if not is_greedy_or_beam_and_bucket:
cache_device = "hpu"
if generation_config.kv_cache_on_host and self.config.model_type in ["llama"]:
print("Allocate KV Cache on CPU...")
cache_device = "cpu"
unwrap_deepspeed_model(self).allocate_kv_cache(
bs * generation_config.num_beams, calculated_max_length, token_idx + num_virtual_tokens,
device=cache_device
)

Thanks, I have updated it in 74e94ff. However, I can not remove the else line because I only modified the modeling_llama.py for this experimental feature.

yeonsily · 2024-09-17T22:03:38Z

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ?
The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

Signed-off-by: Yu Zhentao <[email protected]>

zhentaoyu · 2024-09-18T08:37:45Z

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

Hi @yeonsily, thanks for your comment. Yes, I add a case in README and update the results in the PR description.

yeonsily · 2024-09-18T21:10:35Z

optimum/habana/transformers/models/llama/modeling_llama.py

-                else:
-                    with ht.sdp_kernel(enable_recompute=flash_attention_recompute):
+        else:
+            if kv_cache_on_host:


Can you please explain what's the case switching kv_cache device? I thought line 656 is the case only when line 658.

In this pr, we make kv cache store on cpu and do cpu sdpa only when generating the next token. The first token or prefill stage is performed on HPU due to its powerful computation ability under long-context scenario (long prompt in most cases). The full pipeline diagram shows on the pr description.
So line 658 tells the machine it can do pytorch-cpu sdpa (flash-attn) only when kv_cache_on_host & in next-token generation & inference stage. Otherwise, it will transfer the kv-cache to hpu device if need for its original operations.
Please let me know if you need more explanations or have some suggestions. Thanks.

airMeng · 2024-10-29T05:06:44Z

@zhentaoyu Do you have a use case for "It's an option for long-context inference or generation when a single hpu card OOM." ? The README example is llama 7b and we don't see advantage for this run. If we can put a real example it would be good.

@yeonsily the similiar features already available in tensorrt-llm https://nvidia.github.io/TensorRT-LLM/kv_cache_reuse.html#offloading-to-host-memory

zhentaoyu force-pushed the cpu_sdpa branch from 554b8ac to d5c06c7 Compare August 6, 2024 07:52

zhentaoyu marked this pull request as ready for review August 8, 2024 01:20

zhentaoyu requested review from ssarkar2, bhargaveede, vivekgoe, mandy-li, libinta, dvarshney-habana and regisss as code owners August 8, 2024 01:20

zhentaoyu force-pushed the cpu_sdpa branch from d5c06c7 to 928ab58 Compare August 9, 2024 02:13

zhentaoyu force-pushed the cpu_sdpa branch from 928ab58 to 7ca2c8f Compare September 4, 2024 02:18

zhentaoyu force-pushed the cpu_sdpa branch from 7ca2c8f to ff3c54f Compare September 11, 2024 02:15

zhentaoyu force-pushed the cpu_sdpa branch from ff3c54f to 4b0fa1a Compare September 12, 2024 06:13

yeonsily reviewed Sep 17, 2024

View reviewed changes

zhentaoyu added 4 commits September 18, 2024 02:03

cpu_kv and cpu_sdpa on llama

aee4795

Signed-off-by: Yu Zhentao <[email protected]>

refact code and add README

1b4ee20

Signed-off-by: Yu Zhentao <[email protected]>

fix kv_cache_on_host if statement and add non_blocking copy

fd29d4e

Signed-off-by: Yu Zhentao <[email protected]>

add long-context example in README

74e94ff

Signed-off-by: Yu Zhentao <[email protected]>

zhentaoyu force-pushed the cpu_sdpa branch from 4b0fa1a to 74e94ff Compare September 18, 2024 08:17

yeonsily reviewed Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

zhentaoyu commented Aug 2, 2024 •

edited

Loading

airMeng commented Aug 2, 2024

airMeng commented Aug 7, 2024 •

edited

Loading

emascarenhas commented Sep 3, 2024

zhentaoyu commented Sep 4, 2024

imangohari1 commented Sep 10, 2024 •

edited

Loading

zhentaoyu commented Sep 11, 2024

zhentaoyu commented Sep 12, 2024

yeonsily Sep 17, 2024

zhentaoyu Sep 18, 2024

yeonsily commented Sep 17, 2024

zhentaoyu commented Sep 18, 2024

yeonsily Sep 18, 2024

zhentaoyu Sep 19, 2024

airMeng commented Oct 29, 2024 •

edited

Loading

[llama] Store KV Cache on CPU and Use PyTorch SPDA for Next token generation #1182

Are you sure you want to change the base?

[llama] Store KV Cache on CPU and Use PyTorch SPDA for Next token generation #1182

Conversation

zhentaoyu commented Aug 2, 2024 • edited Loading

What does this PR do?

Results

Limitations

Update

airMeng commented Aug 2, 2024

airMeng commented Aug 7, 2024 • edited Loading

emascarenhas commented Sep 3, 2024

zhentaoyu commented Sep 4, 2024

imangohari1 commented Sep 10, 2024 • edited Loading

zhentaoyu commented Sep 11, 2024

zhentaoyu commented Sep 12, 2024

yeonsily Sep 17, 2024

Choose a reason for hiding this comment

zhentaoyu Sep 18, 2024

Choose a reason for hiding this comment

yeonsily commented Sep 17, 2024

zhentaoyu commented Sep 18, 2024

yeonsily Sep 18, 2024

Choose a reason for hiding this comment

zhentaoyu Sep 19, 2024

Choose a reason for hiding this comment

airMeng commented Oct 29, 2024 • edited Loading

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

[llama] Store KV Cache on CPU and Use PyTorch `SPDA` for Next token generation #1182

zhentaoyu commented Aug 2, 2024 •

edited

Loading

airMeng commented Aug 7, 2024 •

edited

Loading

imangohari1 commented Sep 10, 2024 •

edited

Loading

airMeng commented Oct 29, 2024 •

edited

Loading